Image style transform methods and apparatuses, devices and storage media

ABSTRACT

An image style transform method includes: acquiring an initial image to be subjected to style transform; inputting a gradient of the initial image to an image style transform model, and obtaining a feature map of the initial image in a gradient domain from the image style transform model, where the image style transform model is obtained by being trained in the gradient domain based on a pixel-wise loss and a perceptual loss; and performing image reconstruction according to the feature map of the initial image in the gradient domain to obtain a style image.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application is a continuation of International ApplicationNo. PCT/CN2018/117293 filed on Nov. 23, 2018, which claims priority toChinese Patent Application No. 201810917979.7 filed on Aug. 13, 2018.The disclosures of these applications are hereby incorporated byreference in their entirety.

BACKGROUND

Deep learning-based image style transform is a new research problem inrecent years. Although the image style transform problem has alwaysexisted, German researcher Gatys uses a neural network method for thefirst time in 2015, which opens the door to creating image art stylewith deep learning. The current technology does not optimize the styletransform of face photos. For example, when the existing method isapplied to a self-portrait image, the common shortcomings are: thedeformation of the face edge caused by the image style transform and theinconsistency of the face color.

SUMMARY

The present disclosure relates to image technologies, and in particular,to image style transform methods and apparatuses, devices and storagemedia.

In view of the above, embodiments of the present disclosure provideimage style transform methods and apparatuses, devices and storage mediafor solving at least one problem existing in the prior art.

The technical solutions of the embodiments of the present disclosure areimplemented as follows.

The embodiments of the present disclosure provide an image styletransform method, including: acquiring an initial image to be subjectedto style transform; inputting a gradient of the initial image to animage style transform model, and obtaining a feature map of the initialimage in a gradient domain from the image style transform model, wherethe image style transform model is obtained by being trained in thegradient domain based on a pixel-wise loss and a perceptual loss; andperforming image reconstruction according to the feature map of theinitial image in the gradient domain to obtain a style image.

The embodiments of the present disclosure provide an image styletransform apparatus, including: a memory storing processor-executableinstructions; and a processor arranged to execute the storedprocessor-executable instructions to perform operations of: acquiring aninitial image to be subjected to style transform; inputting a gradientof the initial image to an image style transform model, and obtaining afeature map of the initial image in a gradient domain from the imagestyle transform model, where the image style transform model is obtainedby being trained in the gradient domain based on a pixel-wise loss and aperceptual loss; and performing image reconstruction according to thefeature map of the initial image in the gradient domain to obtain astyle image.

The embodiments of the present disclosure provide an image styletransform apparatus, including: an acquisition unit, configured toacquire an initial image to be subjected to style transform; anobtaining unit, configured to input a gradient of the initial image toan image style transform model, and obtain a feature map of the initialimage in a gradient domain from the image style transform model, wherethe image style transform model is obtained by being trained in thegradient domain based on a pixel-wise loss and a perceptual loss; and areconstruction unit, configured to perform image reconstructionaccording to the feature map of the initial image in the gradient domainto obtain the style image.

The embodiments of the present disclosure provide a computer device,including a memory and a processor, where the memory stores a computerprogram that can be run in the processor, and the processor executes theprogram to realize operations of the image style transform method.

The embodiments of the present disclosure provide a computer-readablestorage medium, having a computer program stored thereon, where when thecomputer program is executed by a processor, operations of the imagestyle transform method are implemented.

The embodiments of the present disclosure provide a computer programproduct, including a computer executable instruction, where the computerexecutable instruction is executed to implement operations of the imagestyle transform method.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic structural diagram of a network architectureaccording to embodiments of the present disclosure;

FIG. 2A is a schematic flowchart for implementing an image styletransform method according to embodiments of the present disclosure;

FIG. 2B is a schematic diagram of a download scenario according toembodiments of the present disclosure;

FIG. 3A is a schematic diagram I of an implementation scenario accordingto embodiments of the present disclosure;

FIG. 3B is a schematic diagram II of an implementation scenarioaccording to embodiments of the present disclosure;

FIG. 4A is a schematic diagram III of an implementation scenarioaccording to embodiments of the present disclosure;

FIG. 4B is a schematic diagram IV of an implementation scenarioaccording to embodiments of the present disclosure;

FIG. 5A is a schematic structural diagram of a convolutional neuralnetwork model according to embodiments of the present disclosure;

FIG. 5B is a schematic structural diagram of a pixel-wise loss modelaccording to embodiments of the present disclosure;

FIG. 6 is a schematic structural diagram of an image style transformapparatus according to embodiments of the present disclosure; and

FIG. 7 is a schematic diagram of a hardware entity of a computer deviceaccording to embodiments of the present disclosure.

DETAILED DESCRIPTION

The process of generating a style image using a neural network method isgenerally as follows: using a neural network model such as model VGG16or VGG19 to separately perform image feature extraction on a contentimage and a style image, i.e., extracting a content feature from thecontent image and extracting a style feature from the style image. Byconstructing loss functions for the content feature and the stylefeature, a loss value is calculated for a random initialization imageand a redraw image is fed back to obtain a generated image. Thegenerated image is similar to the content image in content, and similarto the style image in style. However, this algorithm requires trainingeach time an image is generated, which takes a long time.

A network is trained based on a fast style transfer algorithm, any imagemay be transformed to a style corresponding to the network, andtherefore, the network is forward-propagated each time an image isgenerated, and the speed will be fast.

The fast transfer algorithm generally contains two networks: an imagetransform network and a loss network. The image transform network isused to transform the image. The parameters of the image transformnetwork are variable. However, the parameters of the loss network arekept unchanged. A VGG-16 network trained in an ImageNet image librarycan be used as the loss network, and the content image is converted byimage. A result image of the content image through the image transformnetwork, a style image, and the content image pass through the lossnetwork to extract a perceptual loss, and the image transform network istrained by using the perceptual loss. In the training phase, a largenumber of images are used to train the image transform network to obtaina model. In the output phase, a model is used to output and generate agenerated image. The resulting network is three orders of magnitudefaster than the Gatys model for generating the generated image.

However, the current technology does not optimize the style transform offace photos. For example, when the existing method is applied to aself-portrait image, there are two obvious shortcomings: 1) the edge ofthe face may deviate from the original image, that is, the structuralinformation of an output image changes; 2) the skin color of the facemay be inconsistent with the original skin color, that is, the colorinformation of the output image changes. The consequence is: afterstylization, the user will feel that it is not like himself For example,the portrait of user A in the initial image is a round face, and afterstylization, the portrait of user A in the output style image is an awlface. For another example, the skin of user B is fair, and afterstylization, the skin of user B in the output style image is dark. Thatis, how to better maintain the structural information and colorinformation of the original initial image becomes a problem to besolved.

In order to solve the problems in the current technology, theembodiments of the present disclosure provide a Convolutional NeuralNetwork (CNN) structure of image style transform based on an imagegradient domain. Due to the edge protection of gradient domain learning,the image style transform network provided by the embodiments of thepresent disclosure can overcome the disadvantages of the edgedeformation of the previous method. In the embodiments of the presentdisclosure, in the image reconstruction phase of image style transform,a term called color confidence is introduced to maintain the fidelity ofthe skin color of the resulting image. The image reconstruction phaseutilizes both the structural information of the content image and thecolor information of the content image, which makes the result morenatural.

In the embodiments of the present disclosure, the perceptual loss isdirectly used in the gradient domain for the first time, so that thelearned style information is focused on the stroke rather than thecolor, making it more suitable for the style transform tasks of theface.

In order to better understand the various embodiments of the presentdisclosure, the relevant nouns are now explained:

Sampling operation: the sampling operation generally refers to asubsampled operation or down-sampled operation. If the sampling objectis a continuous signal, the continuous signal is subjected to thesubsampled operation to obtain a discrete signal. For an image, thepurpose of the subsampled operation may be to reduce the image for easeof calculation. The principle of the subsampled operation is: an image Ihaving a size of M*N is subsampled s times to obtain a resolution imagehaving a size of (M/s)*(N/s), and certainly, s should be a commondivisor of M and N. If an image in a matrix form is considered, an imagein the s*s window of the original image is turned into a pixel, thevalue of this pixel point is an average value of all pixels in thewindow.

Up-sampling operation: the inverse process of the subsampled operation,also called up-sampling or interpolating. For an image, ahigh-resolution image can be obtained by the up-sampling operation. Theprinciple of the up-sampling operation is: image magnification almostuses the interpolating method, that is, new pixels are interposedbetween pixel points by using a suitable interpolation algorithm basedon the original image pixels.

Channel: this word has two different meanings. The first meaning is fora sample image (an image is used as a training sample), the channelrefers to a color channel (the number of color channels in the exampleimages), and the color channel will be used below to represent a channelof the sample image. The second meaning is the dimension of the outputspace, such as the number of output channels in a convolution operation,or the number of convolution kernels in each convolutional layer.

Color channel: an image is de-composited into one or more colorcomponents. In a single-color channel, one pixel point is onlyrepresented by one numerical value and can only represent gray scale,and 0 is black. In a three-color channel, if a Red Green Blue (RGB)color mode is used, the image is divided into three color channels ofred, green and blue, which can represent colors, and all 0s representblack. In a four-color channel, an alpha channel is added to the RGBcolor mode to represent transparency, and alpha=0 represents fulltransparent.

A CNN is a multi-layer supervised learning neural network. Aconvolutional layer and a pool sampling layer of a hidden layer are coremodules for implementing a feature extraction function of the CNN. Alower hidden layer of the CNN consists of a convolutional layer and amax-pooling sampling layer, alternately. An upper layer is a hiddenlayer and logistic regression classifier of a full-connected layercorresponding to the conventional multi-layer perceptron. An input of afirst full-connected layer is a feature image obtained by performingfeature extraction on the convolutional layer and a sub-sampling layer.The last output layer is a classifier that classifies the image by usinglogistic regression, Softmax regression or even a support vectormachine. Each layer in the CNN consists of multiple maps. Each mapconsists of multiple neural units. All neural units of the same mapshare a convolution kernel (i.e., weight). The convolution kernel oftenrepresents a feature, for example, a certain convolution kernelrepresents an arc, then the convolution kernel is convolved over theentire image, and the region with a larger convolution value is mostlikely an arc. The CNN generally uses a convolutional layer and asampling layer alternately, that is, one convolutional layer isconnected to one sampling layer, and the sampling layer is followed by aconvolutional layer. Certainly, multiple convolutional layers may beconnected to one sampling layer, so that the convolutional layerextracts features, and then the features are combined to form moreabstract features, and finally descriptive features of the image objectsare formed. The CNN can also be followed by a full-connected layer.

The CNN structure includes a convolutional layer, a down-sampled layer,and a full-connected layer. Each layer has multiple feature maps, eachof which extracts one input feature by means of a convolution filter.Each feature map has multiple neurons. The convolutional layer is usedbecause an important feature of convolution operation, that is, theconvolution operation can enhance the original signal features andreduce noise. The down-sampled layer is used because the sub-sampling ofthe image according to the principle of image local correlation reducesthe amount of computation while maintaining the image rotationinvariance. The full-connected layer adopts softmax full connection, andthe obtained activation value is the image feature extracted by the CNN.

Activation function: a neuron is the basic unit of a multi-layerperceptron, and its function becomes the active transmission. That is,for a neuron, the input is the input of some or all CNNs or the outputof some or all previous layers. After the calculation of the activationfunction, the result is obtained as an output result of the neuron. Thecommonly used activation functions are a sigmoid function, a tan hfunction, and a Rectified Linear Unit (ReLu).

ReLu function, the formula is ReLu(x)=max(0, x). It can be seen from thegraph of the ReLu function that ReLu has three main changes comparedwith other activation functions such as the sigmoid function: (1)unilateral suppression; (2) relatively broad excitement boundary; and(3) sparse activation.

Pixel-wise Loss: assuming that Test is an output result of the CNN, andIHR is the original high-resolution image, then the pixel-wise lossemphasizes the matching of each corresponding pixel between two imagesTest and IHR, which is different from the perceptual result of the humaneye. In general, images trained by the pixel-wise loss are generallysmoother and lack high-frequency information.

Perceptual Loss: assuming that Test represents an output result of theCNN, and IHR represents the original high-resolution image, the Test andIHR are respectively input into a differentiable function Φ, whichavoids the requirement that the network output image is consistent withthe original high-resolution image on a pixel-wise basis.

VGG model: the VGG model structure is simple and effective, the firstfew layers only use 3×3 convolution kernel to increase the networkdepth, the number of neurons in each layer is reduced by means of maxpooling, and the last three layers are two full-connected layers of 4096neurons and a softmax layer, respectively. “16” and “19” represent thenumber of network layers in the network that need to update a weight(i.e., a weight, a parameter to be learned). The weights of the VGG16model and the VGG19 model are trained by means of ImageNet.

Model parameters may generally be understood as configuration variablesinside the model. Values of the model parameters may be estimated usinghistorical data or training samples, or, the model parameters arevariables that can be automatically learned by means of historical dataor training samples. To some extent, the model parameters have thefollowing features: model parameters are required for model prediction;model parameter values can define model functions; model parameters areobtained by data estimation or data learning; model parameters aregenerally not manually set by practitioners; model parameters aregenerally stored as a part of the learning model; and model parametersare generally estimated using an optimization algorithm. Theoptimization algorithm is an efficient search for possible values of theparameters. In artificial neural networks, the weight and deviation of anetwork model are generally referred to as model parameters.

Model hyper-parameters may generally be understood as configurationsoutside the model, and the values thereof cannot be estimated from thedata. To some extent, the model hyper-parameter features are: modelhyper-parameters are generally used in the process of estimating modelparameters; model hyper-parameters are generally specified directly bythe practitioner; model hyper-parameters may generally be set using aheuristic method; and model hyper-parameters are generally adjustedaccording to a given predictive modeling problem. In other words, themodel hyper-parameters are used to determine some parameters of themodel. The hyper-parameters are different, and the models are different.The meaning of model difference is that there is a minor difference. Forexample, assuming that the models are all CNN models, if the number oflayers is different, the models are different, although the models areall CNN models. In deep learning, the hyper-parameters are: learningrate, the number of iterations, the number of layers, the number ofneurons per layer, and so on.

The technical solutions of the present disclosure are further describedbelow in detail with reference to the accompanying drawings andembodiments.

The embodiments of the present disclosure first provide a networkarchitecture. FIG. 1 is a schematic structural diagram of a networkarchitecture according to embodiments of the present disclosure. Asshown in FIG. 1, the network architecture includes two or moreelectronic devices 11 to 1N and a server 31, where the electronicdevices 11 to 1N are interacted with the server 31 by means of a network21. The electronic device may be implemented in various types ofcomputer devices having information processing capabilities, forexample, the electronic device may include a mobile phone, a tabletcomputer, a desktop computer, a personal digital assistant, a navigator,a digital telephone, a television, and the like.

The embodiments of the present disclosure provide an image styletransform method, which can effectively solve the problem that thestructural information of the output image changes compared with theinitial image. The method is applied to an electronic device, and thefunction implemented by the method can be implemented by a processor inthe electronic device by invoking a program code. Certainly, the programcode may be saved in a computer storage medium. Hence, the electronicdevice includes at least a processor and a storage medium.

FIG. 2A is a schematic flowchart for implementing an image styletransform method according to embodiments of the present disclosure. Asshown in FIG. 2A, the method includes operations S201 to S203.

In operation S201, an initial image to be subjected to style transformis obtained.

The image style transform method provided by the embodiments of thepresent disclosure may be embodied by means of a client (application) inthe process of implementation. Referring to FIG. 2B, a user downloadsthe client from a server 31 on an electronic device 12 thereof. Forexample, the electronic device 12 sends a download request to the server31 for downloading the client, the server 31 responds to the downloadrequest, the server 31 sends a download response to the electronicdevice 12, where the download response carries the client, such as anAndroid Package (APK) in the Android system, and then the user installsthe downloaded client on the electronic device thereof, and then theelectronic device runs the client, so that the image style transformmethod provided by the embodiments of the present disclosure may beimplemented by the electronic device.

If operation S201 is implemented on the electronic device, then theimplementation process may be such that when the user selects a picturefrom an album, the client receives the user's operation of selecting apicture, that is, the client determines the selected picture as aninitial image to be subjected to style transform; or, the user takes aphoto with a camera of the electronic device or an external camera, andthe client receives the user's operation of taking a photo, that is, theclient determines the captured photo as an initial image to be subjectedto style transform. Those skilled in the art will appreciate that otherembodiments of this operation are possible.

In operation S202, a gradient of the initial image is input to an imagestyle transform model, and a feature map of the initial image in agradient domain is obtained from the image style transform model.

Here, the image style transform model is trained, and is obtained bybeing trained in the gradient domain based on a pixel-wise loss and aperceptual loss. In some embodiments, the image style transform model isobtained by using the pixel-wise loss and the perceptual loss astraining targets in the gradient domain.

In operation S203, image reconstruction is performed according to afeature map of the initial image in the gradient domain to obtain astyle image.

The style image is a reconstructed stylized image. In the process ofimplementation, the trained image style transform model may be local tothe electronic device or at the server. When the trained image styletransform model is local to the electronic device, the electronic devicemay be installed with the client, that is, the trained image styletransform model is installed, so that, as shown in FIG. 3A, theelectronic device obtains the initial image through operation S201, thenthe feature map (i.e., an output result) of the initial image in thegradient domain is obtained through operation S202, and finally theoutput style image is obtained through operation S203. It can be seenfrom the above process that after the electronic device is installedwith the client, operations S201 to S203 are all executed locally in theelectronic device. Finally, the electronic device outputs the obtainedstyle image to the user.

In some embodiments, the trained image style transform model may also belocated on the server. As shown in FIG. 3B, the electronic devicetransmits the initial image to the server, and the server receives theinitial image sent by the electronic device, so that the serverimplements operations S201. In other words, if the foregoing method isimplemented on the server, then operation S201 includes: the serverreceives an initial image sent by the electronic device, that is, theserver acquires an initial image subjected to style transform, and thenthe server obtains the feature map of the initial image in the gradientdomain through operation S202, and finally the output style image isobtained through operation S203. It can be seen from the above processthat operations S201 to S203 are performed on the server, and finallythe server may also send the style image to the electronic device, suchthat the electronic device outputs the style image to the user afterreceiving the style image. In the embodiments of the present disclosure,after the electronic device is installed with the client, the useruploads the initial image of the user, and receives the style image sentby the server, and outputs the style image to the user.

In some embodiments, operations S201 to S203 may also be partiallycompleted by the electronic device, or partially by the server. Forexample, referring to FIG. 4A, operations S201 and S202 may be performedlocally by the electronic device, and then the electronic devicetransmits the feature map of the initial image in the gradient domain tothe server, a style image is obtained after the server performsoperation S203, and then the style image is sent to the electronicdevice for output. In another example, referring to FIG. 4B, operationsS201 and S202 may be performed by the server, the server sends thefeature map of the initial image in the gradient domain to theelectronic device, a style image is obtained after the electronic deviceperforms operation S203, and then the style image is output to the user.

In some embodiments, the method further includes: training the imagestyle transform model, where a training target of the image styletransform model is that a total loss L_(total) is minimum, whereL_(total) is represented by the following equation:

L _(total) αL _(feat) +βL _(pixel),

where L_(feat) represents the perceptual loss, L_(pixel) represents thepixel-wise loss, and values of α and β are real numbers. The ratio ofthe α to the β is greater than 10 and less than 105. For example, thevalue of a is 10,000, and the value of β is 1. It should be understoodby those skilled in the art that the values of α and β may be setcorrespondingly according to a specific application scenario, and theembodiments of the present disclosure do not limit the values.

In some embodiments, the image style transform model includes apixel-wise loss model and a perceptual loss model, where the pixel-wiseloss model is a pixel-wise loss model obtained by taking minimization ofthe pixel-wise loss as a training target when being trained in thegradient domain, and the perceptual loss model is obtained by takingminimization of the perceptual loss as a training target when beingtrained in the gradient domain training.

The training process when the pixel-wise loss model is a pixel-wise lossmodel and the perceptual loss model is a perceptual loss model includesoperations S11 to S14.

In operation S11, a gradient of a training sample is determined.Assuming that I_(i) represents the i-th training sample, a gradient ofthe i-th training sample I_(i) is determined as ∂I_(i).

In operation S12, a gradient of the training sample is input to thepixel-wise loss model, and a sample output result of the training sampleis obtained from the pixel-wise loss model, where the gradient ∂I_(i) ofthe i-th training sample I_(i) is input to the pixel-wise loss modelF_(w), and a sample output result F_(w)(∂I_(i)) of the training sampleis obtained from the pixel-wise loss model.

In operation S13, a gradient of a stylized reference image correspondingto the training sample is determined, where the stylized reference imagemay be an unsatisfactory stylized reference picture obtained by theexisting stylization algorithm. Assuming that a stylized reference imagecorresponding to the training sample I_(i) is

(I_(i)), then the gradient of the reference image is ∂

(I_(i)).

In operation S14, the perceptual loss model is trained according to afirst output feature map of the gradient of the reference image in aj-th convolutional layer of the perceptual loss model and a secondoutput feature map of the sample output result in the j-th convolutionallayer of the perceptual loss model. The j-th convolutional layer is anylayer in the CNN model. When the CNN is VGG16, the j-th convolutionallayer is a conv3-3 layer in the VGG16.

In some embodiments, the pixel-wise loss model includes a firstconvolutional layer set, an up-sampling layer, and a secondconvolutional layer set. The training the pixel-wise loss modelaccording to the gradient of the reference image and the sample outputresult includes: inputting the gradient of the training sample to thefirst convolutional layer set to obtain a sample feature map; inputtingthe sample feature map to the up-sampling layer, and up-sampling thesample feature map to the pixel size of the initial image; and inputtingthe up-sampled sample feature map to the second convolutional layer setto obtain a sample output result.

In some embodiments, the training the perceptual loss model according toa first output feature map of the gradient of the reference image in aj-th convolutional layer of the perceptual loss model and a secondoutput feature map of the sample output result in the j-th convolutionallayer of the perceptual loss model includes:

the perceptual loss model is trained using the following equation:

${L_{feat} = {\frac{1}{C_{j}H_{j}W_{j}}{{{\psi_{j}\left( {\partial{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial I_{i}} \right)} \right)}}}}},$

where ∂I_(i) represents a gradient of an i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model, ∂

(I_(i)) represents a gradient of a stylized reference image of the i-thtraining sample; ψ_(i)( ) represents an output feature map of the j-thconvolutional layer when the perceptual loss model adopts aconvolutional neural network model, and C_(j)H_(j)W_(j) respectivelyrepresent the number of channels, the height, and the width of thefeature map corresponding to the j-th convolutional layer.

In some embodiments, when the CNN model is VGG16, the j-th convolutionallayer is conv3-3.

In some embodiments, the training process when the pixel-wise loss modelis a pixel-wise loss model includes operations S21 to S24.

In operation S21, a gradient of a training sample is determined; inoperation S22, the gradient of the training sample is used as an inputof the pixel-wise loss model, and a sample output result is obtainedfrom the pixel-wise loss model; in operation S23, a gradient of astylized reference image corresponding to the training sample isdetermined; and in operation S24, the pixel-wise loss model is trainedaccording to the gradient of the reference image and the sample outputresult. The training the pixel-wise loss model according to the gradientof the reference image and the sample output result includes: trainingthe pixel-wise loss model according to an absolute value of a differencebetween F_(w)(∂I_(i)) and corresponding ∂

(I_(i)) of each training sample; where ∂I_(i) represents the gradient ofthe i-th training sample, F_(w) represents the pixel-wise loss model,F_(w)(∂I_(i)) represents an output result of the gradient of the i-thtraining sample through the pixel-wise loss model F_(w), and ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample.

In some embodiments, the training the pixel-wise loss model according toan absolute value of a difference between F_(w)(∂I_(i)) andcorresponding ∂

(I_(i)) of each training sample includes: training the pixel-wise lossmodel by using the following equation:

${L_{pixel} = {\frac{1}{D}{\sum\limits_{i = 0}^{D - 1}\; \left\{ {\frac{1}{2}{{{F_{W}\left( {\partial I_{i}} \right)} - {\partial{\mathcal{L}\left( I_{i} \right)}}}}^{2}} \right\}}}},$

where ∂I_(i) represents the gradient of the i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model, ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample, and D represents a number of samples in a trainingsample set.

In some embodiment, the performing image reconstruction according to thefeature map of the initial image in the gradient domain to obtain astyle image includes: using an image that satisfies a structuralsimilarity condition to a feature map of the initial image in thegradient domain as the style image. Satisfying the structural similaritycondition to the feature map of the initial image in the gradient domainsatisfies includes: a degree of structural difference between the styleimage and the initial image is less than a similarity threshold, or thedegree of structural difference between the style image and the initialimage is the smallest, where the degree of structural difference is avariation trend of the style image in the gradient domain and thefeature map of the initial image in the gradient domain in at least onereference direction.

The reference direction may take the x and y directions of the image inthe plane reference coordinate system. Certainly, there are other moredirections, or only one direction is used. The degree of difference mayadopt the difference or the absolute value of the difference or variousmathematical deformation operations based on the difference (forexample, the quadratic sum of the absolute difference values in the xand y directions, i.e., ∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥²,where I represents the initial image, S represents the style image, and∥ ∥ represents an absolute value sign).

In some embodiments, the performing image reconstruction according tothe feature map of the initial image in the gradient domain to obtain astyle image includes: performing image reconstruction according to∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥² to obtain the styleimage, where ∂_(x)I represents the gradient of the initial image in thex direction, F_(w)(∂_(x)I) represents a feature map of the gradient ofthe initial image in the x direction in the gradient domain through theimage style transform model, ∂_(y)I represents the gradient of theinitial image in the y direction, F_(w)(∂_(y)I) represents a feature mapof the gradient of the initial image in the y direction in the gradientdomain through the image style transform model, ∂_(x)S represents thegradient of the style image in the x direction, and ∂_(y)S representsthe gradient of the style image in the y direction.

In some embodiments, the performing image reconstruction according tothe feature map of the initial image in the gradient domain to obtain astyle image includes: performing image reconstruction according to colorinformation of the initial image and the feature map of the initialimage in the gradient domain to obtain the style image. The performingimage reconstruction according to color information of the initial imageand the feature map of the initial image in the gradient domain toobtain a style image includes: using an image that satisfies astructural similarity condition to a feature map of the initial image inthe gradient domain and an image that satisfies a color similaritycondition to the initial image as the style image.

In some embodiments, the method further includes: performing featureextraction on the initial image to obtain a face region in the initialimage. Correspondingly, the performing image reconstruction according tocolor information of the initial image and the feature map of theinitial image in the gradient domain to obtain a style image includes:using an image that satisfies a structural similarity condition to afeature map of the initial image in the gradient domain, and an imagethat satisfies a color similarity condition to a face region in theinitial image as the style image. The color similarity condition is acolor similarity condition that the color information satisfies, thatis, the degree of difference between the color of the style image andthe initial image is less than a set value or the minimum, where thedegree of difference of the color is represented by a difference betweenthe colors of the sampling points of the image to be processed and thetarget image, i.e., is represented by ∥S−I∥, where I represents theinitial image, and S represents the style image).

In the embodiments of the present disclosure, in order not to change thecolor of the initial image or the skin color of the face, a colorsimilarity condition is set, where the color similarity condition may bethe color of the entire initial image, or may be the skin color of theface in the initial image. It should be noted that the above twoconditions, i.e., the structural similarity condition and the colorsimilarity condition can be used theoretically separately, that is, onlyone condition is used to calculate the style image; or the twoconditions can be used simultaneously, and corresponding coefficients(weights) are assigned at the same time, for example, the value of λ, isa real number.

In some embodiments, the using an image that satisfies a structuralsimilarity condition to a feature map of the initial image in thegradient domain and an image that satisfies a color similarity conditionto the initial image as the style image includes: performing imagereconstruction according to∥S−I∥+λ{∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥²} to obtain thestyle image, where I represents the initial image, S represents thestyle image, ∂_(x)I represents a gradient of the initial image in the xdirection, F_(w)(∂_(x)I) represents a feature map of the gradient of theinitial image in the x direction in the gradient domain through theimage style transform model, ∂_(y)I represents the gradient of theinitial image in the y direction, F_(w)(∂_(y)I) represents a feature mapof the gradient of the initial image in the y direction in the gradientdomain through the image style transform model, ∂_(x)S represents thegradient of the style image in the x direction, and ∂_(y)S representsthe gradient of the style image in the y direction.

In some embodiments, the inputting a gradient of the initial image to animage style transform model, and obtaining a feature map of the initialimage in a gradient domain from the image style transform modelincludes: operation S31, a gradient of the initial image in at least onereference direction is determined; and operation S32, the gradient inthe at least one reference direction is input to the image styletransform model, and a feature map of the initial image in the at leastone reference direction in the gradient domain is correspondinglyobtained from the image style transform model, and correspondingly,image reconstruction is performed according to the feature map in the atleast one reference direction in the gradient domain to obtain a styleimage.

In some embodiments, the at least one reference direction includes x andy directions in a plane reference coordinate system. Correspondingly,the determining the gradient of the initial image in at least onereference direction includes: determining the gradients of the initialimage in the x and y directions, respectively. The inputting thegradient in at least one reference direction in the image styletransform model, and correspondingly obtaining a feature map of theinitial image in the at least one reference direction in the gradientdomain from the image style transform model includes: respectivelyinputting the gradients in the x and y directions to the image styletransform model, and correspondingly obtaining a feature map of theinitial image in the x and y directions in the gradient domain from theimage style transform model. Correspondingly, the performing imagereconstruction according to the feature map in the at least onereference direction in the gradient domain to obtain a style imageincludes: performing image reconstruction according to the feature mapin the x and y directions in the gradient domain to obtain the styleimage. The technical solution of the embodiments of the presentdisclosure is introduced in three phases. The structure of the CNN modelprovided by the embodiments of the present disclosure is introduced inthe first phase, the training process of the provided CNN model isintroduced in the second phase, and the process of image reconstructionusing the trained CNN, i.e., a method for image style transform of aninitial image is introduced in the third phase.

The First Phase: The Structure of the CNN Model

FIG. 5A is a schematic structural diagram of a CNN model according toembodiments of the present disclosure. As shown in FIG. 5A, the CNNmodel is composed of two parts:

The first part is a CNN 51 (a first CNN) to be trained, which takes thegradient of the self-portrait image as input, followed by a continuousconvolutional layer and the ReLu layer, and then up-samples the featuremap to the content image size by using the up-sampling operation, andfinally calculates the pixel-wise Loss L_(pixel) with the gradient ofthe artistic style reference image, where taking the gradient of theself-portrait image as input includes: respectively using the gradient∂_(x)I of the self-portrait image in the x direction and the gradient∂_(y)I of the self-portrait image in the y direction as the input of theCNN.

In the CNN, each convolution filter of the convolutional layer isrepeatedly applied to the entire receptive field, the inputself-portrait image is convoluted, and the result of convolutionconstitutes a feature map of the input self-portrait image, so thatlocal features of the self-portrait image are extracted. Onecharacteristic of CNN is: max-pooling sampling, which is a nonlineardown-sampling method. It can be seen from the mathematical formula ofthe max-pooling that the max-pooling is to take the maximum featurepoint in the neighborhood. After the image features are acquired bymeans of convolution, these features are used for classification. Afterthe convolution feature map of the image is acquired, dimensionreduction is performed on the convolution feature by means of amax-pooling sampling method. The convolution feature is divided into anumber of disjoint regions, and the maximum (or average) features ofthese regions are used to represent the dimension-reduced convolutionfeatures. The function of the max-pooling sampling method is reflectedin two aspects: (1) the max-pooling sampling method reduces thecomputational complexity from the upper hidden layer; and (2) thesepooling units have translation invariance, even if the image has smalldisplacement, the extracted features still remain unchanged. Due to theenhanced robustness to displacement, the max-pooling sampling method isan efficient sampling method for reducing the data dimension.

The second part is a VGG-16 network 52 (a second CNN) trained inImageNet for calculating the perceptual loss L_(feat). The output of theconv3-3 layer of the VGG-16 is actually used to calculate the perceptualloss.

Finally, the sum of the L_(pixel) of the first part and the L_(feat) ofthe second part is the final total target function to be calculated(i.e., the total loss L_(total)).

In one embodiment, the total target function L_(total) may be calculatedusing the following formula (3-1): L_(total)=αL_(feat)+βL_(pixel) (3-1),where the values of α and β are real numbers. For example, α and β areset to integers in training, respectively.

The image gradient is briefly introduced below. The image gradient is amethod for describing the differences between image pixels and can beused as a feature of an image to representing the image. From theperspective of mathematics, the image gradient refers to a first-orderderivative of the pixel. The following equations (3-2) and (3-3) may beused to represent the gradient ∂_(x)I of the image in the x directionand the gradient ∂_(y)I in the y direction, respectively:

∂_(x) I=I(x, y)−I(x−1, y)   (3-2).

∂_(y) I=I(x, y)−I(x, y−1)   (3-3).

It should be noted that there are many calculation methods forcalculating the gradient of the image itself, as long as the differencebetween the pixels may be described. Those skilled in the art shouldunderstand that the gradient of the image is not necessarily calculatedwith the foregoing equations (3-2) and (3-3). In fact, other equationsare generally used. For example, if the convolution operations aresuperimposed to calculate the image gradient, then a template used isgenerally called a gradient operator. Common gradient operators includea Sobel operator, a Robinson operator, a Laplace operator and so on.

The Second Stage: The Training Process of the First Part of the CNN

First, the training sample is determined. Assuming that D groups oftraining maps (I₀,

(I₀)), (I₁,

(I₁)), . . . , (I_(D−1),

(I_(D−1))) are collected, I_(i) represents the i-th original image, and

(I_(i)) represents an unsatisfactory stylized reference image obtainedfrom the i-th original image I_(i) by using the existing stylizationalgorithm.

The definition of the pixel-wise loss L_(pixel) calculated by the firstpart in FIG. 3 is as shown in equation (4-1):

$\begin{matrix}{\mspace{79mu} {{L_{pixel} = {\frac{1}{D}{\sum\limits_{i = 0}^{D - 1}\; \left\{ {\frac{1}{2}{{{F_{W}\left( {\partial I_{i}} \right)} - {\partial{\mathcal{L}\left( I_{i} \right)}}}}^{2}} \right\}}}}{L_{pixel} = {\frac{1}{D}{\sum\limits_{i = 0}^{D - 1}\; {\left\{ {{\frac{1}{2}{{{F_{W}\left( {\partial_{x}I_{i}} \right)} - {\partial_{x}{\mathcal{L}\left( I_{i} \right)}}}}^{2}} + {\frac{1}{2}{{{F_{W}\left( {\partial_{y}I_{i}} \right)} - {\partial_{y}{\mathcal{L}\left( I_{i} \right)}}}}^{2}}} \right\}.}}}}}} & \left( {4\text{-}1} \right)\end{matrix}$

In equation (4-1), ∂_(x) represents a gradient or gradientrepresentation of the i-th original image I_(i) in the x direction, and∂_(y) represents a gradient or gradient representation in the ydirection. ∂I_(i) represents the gradient of the original image,∂_(x)I_(i) represents the gradient of the original image I_(i) in the xdirection, and ∂_(y)I_(i) represents the gradient of the original imageI_(i) in the y direction. F_(w) represents the CNN model of the firstpart, and therefore, F_(w)(∂I_(i)) represents the result of the gradientof the i-th original image I_(i) through the CNN model,F_(w)(∂_(x)I_(i)) represents the result of the gradient of the i-thoriginal image I_(i) in the x direction through the CNN model, andF_(w)(∂_(y)I_(i)) represents the result of the gradient of the i-thoriginal image I_(i) in the y direction through the CNN model. ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th original image I_(i), ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th original image I_(i) in the x direction , and ∂_(y)

(I_(i)) represents the gradient of the stylized reference image of thei-th original image I_(i) in the y direction.

The definition of the perceptual loss L_(feat) calculated by the secondpart in FIG. 3 is as shown in equation (4-2):

$\begin{matrix}{\mspace{79mu} {{L_{feat} = {\frac{1}{C_{j}H_{j}W_{j}}{{{\psi_{j}\left( {\partial{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial I_{i}} \right)} \right)}}}}}{L_{feat} = {\frac{1}{C_{j}H_{j}W_{j}}{\left\{ {{{{\psi_{j}\left( {\partial_{x}{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial_{x}I_{i}} \right)} \right)}}} + {{{\psi_{j}\left( {\partial_{y}{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial_{y}I_{i}} \right)} \right)}}}} \right\}.}}}}} & \left( {4\text{-}2} \right)\end{matrix}$

In equation (4-2), ψ_(j)( ) represents the output feature map of thej-th convolutional layer of the VGG-16 network, and C_(j), H_(j), andW_(j) respectively represent the number of channels, the height and thewidth of the feature map corresponding to the j-th convolutional layer.In the process of implementation, the conv3-3 layer of the VGG-16 isused. The meanings of F_(w)(∂I_(i)) and ∂

(I_(i)) are the same as those in the first part. F_(w)(∂I_(i))represents the result of the gradient of the original image through thenetwork, and ∂

(I_(i)) represents the gradient of the stylized reference image of theoriginal image.

The total target function is a sum of the perceptual loss L_(feat) andthe pixel-wise loss L_(pixel).

L _(total) =αL _(feat) +βL _(pixel)   (4-3).

In equation (4-3), the values of a and are real numbers. For example, αand β are respectively set to integers in training. In the training, αand β are set to 10,000 and 1, respectively, and 100K iterations areperformed with NVIDIA Titan X GPU. The adam optimization method is usedto optimize the target function equation 3. The learning rate is 10⁻⁸ inthe first 50K iterations. The the learning rate is set to 10⁻⁹ in thelater 50K iterations. It should be noted that some modifications may bemade to equations (4-1) and (4-2) by those skilled in the art during theimplementation process. For equation (4-1), as long as thesemodifications can indicate pixel-wise loss, for example, ½ in equation(4-1) is modified to another value, such as ¼ or ⅓, etc., the square ofthe absolute value in equation (4-1) is modified to an absolute value,or the square of the absolute value in equation (4-1) is modified to thesquare root of the absolute value.

The Third Phase: The Image Reconstruction Process

When a new image is input, such as a new self-portrait image, the outputstylized image is determined by using the following equation (5) toobtain a corresponding style image thereof.

∥S=I∥+λ{∥∂ _(x) S−F _(w)(∂_(x) I)∥²+∥∂_(y) S−F _(w)(∂_(y) I)∥²}  (5).

In equation (5), I represents the new self-portrait image, i.e., theinitial image, S represents the style image corresponding to the newself-portrait image, ∂_(x)I represents the gradient of the self-portraitimage in the x direction, F_(w)(∂_(x)I) represents the output of thegradient of the self-portrait image in the x direction through thetrained model, and similarly, ∂_(y)I represents the gradient of theself-portrait image in the y direction, F_(w)(∂_(y)I) represents theoutput of the gradient of the self-portrait image in the y directionthrough the trained model, ∂_(x)S represents the gradient of the styleimage in the x direction, and ∂_(y)S represents the gradient of thestyle image in the y direction. In the foregoing equation, ∥S−I∥performs image reconstruction by using the color reconstruction by usingcolor information of the content image.{∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥²} performs imagereconstruction by using the structural information of the content image,and A represents weight parameters of the two pieces of information. Inthe process of implementation, λ is 10. By optimizing the foregoingequation, S may be obtained, i.e., the style image of the newself-portrait image.

In some embodiments, the first part is a CNN 51 to be trained (the firstCNN), which may be the CNN as shown in FIG. 5B. FIG. 5B is a schematicstructural diagram of a CNN model according to embodiments of thepresent disclosure. As shown in FIG. 5B, the model includes:

In an input layer 501, a gradient of the self-portrait image in the x ory direction is used as an input. It should be noted that h representsthe high of the gradient of the self-portrait image in the x or ydirection, and w represents the width of the gradient of theself-portrait image in x or y direction. For a self-portrait image I,the gradient of the self-portrait image I in the x direction is ∂_(x)Iand the gradient of the self-portrait image I in the y direction is∂_(y)I and then each color channel (or color component) of ∂_(x)I and∂_(y)I is used as input. If a Red-Green-Blue (RGB) color model is used,there are three color channels. Correspondingly, for a self-portraitimage, there are 6 inputs, namely, ∂_(x)I in the R color channel, ∂Il inthe G color channel and ∂_(x)I in the B color channel, ∂_(y)I in the Rcolor channel, ∂_(y)I in the G color channel, and ∂_(y)I in the B colorchannel. Conv1+ReLu1 layer, conv2+ReLu2 layer, conv3+ReLu3 layer,conv4+ReLu4 layer, conv5+ReLu5 layer, conv6+ReLu6 layer, and conv7+ReLu7layer.

After passing through the convolutional layer and the ReLu layer, theoutput result is a feature map 502 with the high of

$\frac{h}{r},$

the width of

$\frac{w}{r},$

and the number of channels of c, where r is a coefficient, and thevalues of r and c are related to model hyper-parameters of theconvolutional neural network model in the embodiments of the presentdisclosure. In the embodiments of the present disclosure, the modelhyper-parameter includes the size of a convolution kernel, the stride ofthe convolution kernel, and the padding of the input feature map. Ingeneral, the number of convolution kernels determines the number ofchannels c of the output feature map.

In the up-sampling layer, the inputs are 511 to 51C, and the outputs are521 to 52C. The output feature map is disassembled according to thenumber of channels c, so that c feature maps 511 to 51C are obtained,and each of the feature maps 511 to 51C is up-sampled to the size of theinitial image. The initial image mentioned in the input layer 501 is aself-portrait image, and the size of the self-portrait image is h*w, andthus, the sizes 521 to 52C of the up-sampled images output by theup-sampling layer are also h*w. In the up-sampling layer, the outputcorresponding to the input 511 is 521, the output corresponding to theinput 512 is 522, and so on, and the output corresponding to the input51C is 52C.

A synthesis layer 531 has inputs of 521 to 52C and an output of 531. Theup-sampled images 521 to 52C are combined to obtain a feature map 531.An output layer has an input of 531 and an output of 541. The featuremap 531 is convoluted and excited. That is, the feature map 531 is inputto conv8, ReLu8, and conv9 to obtain an output 541, and the size of theoutput 541 is the size h*w of the original image.

It should be noted that the convolutional neural network model shown inFIG. 5B can be used to replace the network portion 53 in FIG. 5A. In theembodiments of the present disclosure, the convolution process beforeup-sampling has seven layers, respectively conv1 to conv7, and theexcitation process before up-sampling also has seven layers,respectively ReLu1 to ReLu7. The seven convolutional layers (conv1 toconv7) may be regarded as a first convolutional layer set of thepixel-wise loss model. Certainly, the seven convolutional layers and theseven excitation layers (ReLu1 to ReLu7) may also be regarded as a firstconvolutional layer set of the pixel-wise loss model. After theup-sampling, there are also two convolutional layers, respectively conv8and conv9. After the up-sampling, there is also another layer ofexcitation process, i.e., an excitation layer ReLu8. The twoconvolutional layers (conv8 and conv9) may be regarded as a secondconvolutional layer set of the pixel-wise loss model. Certainly, the twoconvolutional layers and the one excitation layer (ReLu8) may also beconsidered as a second convolutional layer set of the pixel-wise lossmodel.

Those skilled in the art should understand that the number ofconvolutional layers before the up-sampling (the number of convolutionallayers in the first convolutional layer set) may vary, for example, fivelayers, nine layers, ten layers, or tens of layers. Correspondingly, thenumber of excitation layers before the up-sampling (the number ofexcitation layers in the first convolutional layer set) may also vary,for example, five layers, six layers, nine layers, 15 layers, etc. Inthe embodiments, before the up-sampling, the convolutional layer isfollowed by an excitation layer, that is, one convolutional layer andone excitation layer are alternated before the up-sampling. Thoseskilled in the art should understand that the number of alternatinglayers of the convolutional layers and the excitation layers may alsovary, for example, two convolutional layers are followed by oneexcitation layer, and then one convolutional layer is followed by twoexcitation layers. In the embodiments of the present disclosure, theexcitation function used by the excitation layer is ReLu. In someembodiments, the excitation layer may also adopt other excitationfunctions, such as a sigmoid function. The pooling layer is not shown inthe embodiment of FIG. 5B. In some embodiments, a pooling layer may alsobe added. After the up-sampling, the number of convolutional layers (thenumber of convolutional layers in the second convolutional layer set),and the order of the convolutional layers and the excitation layers mayvary.

Based on the foregoing embodiments, the embodiments of the presentdisclosure provide an image style transform apparatus, including variousunits, and various modules included in the units, which may beimplemented by a processor in an electronic device, and certainly may beimplemented by a specific logic circuit. In the process ofimplementation, the processor may be a Central Processing Unit (CPU), aMicro Processing Unit (MPU), a Digital Signal Processor (DSP), a FieldProgrammable Gate Array (FPGA), etc.

FIG. 6 is a schematic structural diagram of an image style transformapparatus according to embodiments of the present disclosure. As shownin FIG. 6, the apparatus 600 includes an acquisition unit 601, anobtaining unit 602, and a reconstruction unit 603.

The acquisition unit 601 is configured to acquire an initial image to besubjected to style transform. The obtaining unit 602 is configured toinput a gradient of the initial image to an image style transform model,and obtain a feature map of the initial image in a gradient domain fromthe image style transform model, where the image style transform modelis obtained by being trained in the gradient domain based on apixel-wise loss and a perceptual loss. The reconstruction unit 603 isconfigured to perform image reconstruction according to the feature mapof the initial image in the gradient domain to obtain a style image.

In some embodiments, the apparatus further includes: a training unit,configured to train the image style transform model, where a trainingtarget of the image style transform model is that a total loss L_(total)is minimum, where L_(total) is represented by the following equation:

L _(total) αL _(feat) +βL _(pixel),

wherein L_(feat) represents the perceptual loss, L_(pixel) representsthe pixel-wise loss, and values of α and β are real numbers.

In some embodiments, the ratio of a to is greater than 10 and less than105.

In some embodiments, the image style transform model includes apixel-wise loss model and a perceptual loss model, where the pixel-wiseloss model is a pixel-wise loss model obtained by taking minimization ofthe pixel-wise loss as a training target when being trained in thegradient domain, and the perceptual loss model is obtained by takingminimization of the perceptual loss as a training target when beingtrained in the gradient domain training.

In some embodiments, the training unit includes: a first input module,configured to input a gradient of a training sample to the pixel-wiseloss model, and obtain a sample output result of the training samplefrom the pixel-wise loss model; a first determining module, configuredto determine a gradient of a stylized reference image corresponding tothe training sample; and a first training module, configured to trainthe perceptual loss model according to a first output feature map of thegradient of the reference image in a j-th convolutional layer of theperceptual loss model and a second output feature map of the sampleoutput result in the j-th convolutional layer of the perceptual lossmodel.

In some embodiments, the first training module trains the perceptualloss model by using the following equation:

${L_{feat} = {\frac{1}{C_{j}H_{j}W_{j}}{{{\psi_{j}\left( {\partial{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial I_{i}} \right)} \right)}}}}},$

where ∂I_(i) represents a gradient of an i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model, ∂

(I_(i)) represents a gradient of a stylized reference image of the i-thtraining sample; ψ_(j)( ) represents an output feature map of the j-thconvolutional layer when the perceptual loss model adopts aconvolutional neural network model, and C_(j)H_(j)W_(j) respectivelyrepresent the number of channels, the height, and the width of thefeature map corresponding to the j-th convolutional layer.

In some embodiments, when the convolutional neural network model isVGG16, the j-th convolutional layer is conv3-3.

In some embodiments, the training unit further includes: a seconddetermining module, configured to determine a gradient of a trainingsample; a second input module, configured to use the gradient of thetraining sample as an input of the pixel-wise loss model, and obtain asample output result from the pixel-wise loss model; a third determiningmodule, configured to determine a gradient of a stylized reference imagecorresponding to the training sample; and a second training module,configured to train the pixel-wise loss model according to the gradientof the reference image and the sample output result.

In some embodiments, the pixel-wise loss model includes a firstconvolutional layer set, an up-sampling layer, and a secondconvolutional layer set. The training the pixel-wise loss modelaccording to the gradient of the reference image and the sample outputresult includes: inputting the gradient of the training sample to thefirst convolutional layer set to obtain a sample feature map; inputtingthe sample feature map to the up-sampling layer, and up-sampling thesample feature map to the pixel size of the initial image; and inputtingthe up-sampled sample feature map to the second convolutional layer setto obtain a sample output result.

In some embodiments, the second training module is configured to trainthe pixel-wise loss model according to an absolute value of a differencebetween F_(w)(∂I_(i)) and corresponding ∂

(I_(i)) of each training sample, where ∂I_(i) represents the gradient ofthe i-th training sample, F_(w) represents the pixel-wise loss model,F_(w)(∂I_(i)) represents an output result of the gradient of the i-thtraining sample through the pixel-wise loss model F_(w), and ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample.

In some embodiment, the second training module is configured to trainthe pixel-wise loss model by using the following equation:

${L_{pixel} = {\frac{1}{D}{\sum\limits_{i = 0}^{D - 1}\; \left\{ {\frac{1}{2}{{{F_{W}\left( {\partial I_{i}} \right)} - {\partial{\mathcal{L}\left( I_{i} \right)}}}}^{2}} \right\}}}},$

where ∂I_(i) represents the gradient of the i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model F_(w), ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample, and D represents a number of samples in a trainingsample set.

In some embodiments, the reconstruction unit is configured to use animage that satisfies a structural similarity condition to a feature mapof the initial image in the gradient domain as the style image.

In some embodiments, satisfying the structural similarity condition tothe feature map of the initial image in the gradient domain satisfiesincludes: a degree of structural difference between the style image andthe initial image is less than a similarity threshold, or the degree ofstructural difference between the style image and the initial image isthe smallest, where the degree of structural difference is a variationtrend of the style image in the gradient domain and the feature map ofthe initial image in the gradient domain in at least one referencedirection.

In some embodiments, the reconstruction unit is configured to performimage reconstruction according to∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥² to obtain the styleimage, where ∂_(x)I represents the gradient of the initial image in thex direction, F_(w)(∂_(x)I) represents a feature map of the gradient ofthe initial image in the x direction in the gradient domain through theimage style transform model, ∂_(y)I represents the gradient of theinitial image in the y direction, F_(w)(∂_(y)I) represents a feature mapof the gradient of the initial image in the y direction in the gradientdomain through the image style transform model, ∂_(x)S represents thegradient of the style image in the x direction, and ∂_(y)S representsthe gradient of the style image in the y direction.

In some embodiments, the reconstruction unit is configured to performimage reconstruction according to color information of the initial imageand the feature map of the initial image in the gradient domain toobtain a style image.

In some embodiments, the reconstruction unit is configured to use animage that satisfies a structural similarity condition to a feature mapof the initial image in the gradient domain and an image that satisfiesa color similarity condition to the initial image as the style image.

In some embodiments, the apparatus further includes: an extraction unit,configured to perform feature extraction on the initial image to obtaina face region in the initial image. Correspondingly, the reconstructionunit is configured to use an image that satisfies a structuralsimilarity condition to a feature map of the initial image in thegradient domain, and an image that satisfies a color similaritycondition to a face region in the initial image as the style image.

In some embodiments, the reconstruction unit is configured to performimage reconstruction according to∥S−I∥+λ{∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥²} to obtain astyle image, where I represents the initial image, S represents thestyle image, ∂_(x)I represents a gradient of the initial image in the xdirection, F_(w)(∂_(x)I) represents a feature map of the gradient of theinitial image in the x direction in the gradient domain through theimage style transform model, ∂_(y)I represents the gradient of theinitial image in the y direction, F_(w)(∂_(y)I) represents a feature mapof the gradient of the initial image in the y direction in the gradientdomain through the image style transform model, ∂_(x)S represents thegradient of the style image in the x direction, and ∂_(y)S representsthe gradient of the style image in the y direction.

In some embodiments, the obtaining unit comprises: a fourth determiningmodule, configured to determine a gradient of the initial image in atleast one reference direction; and an obtaining module, configured toinput the gradient in the at least one reference direction to the imagestyle transform model, and correspondingly obtain a feature map of theinitial image in the at least one reference direction in the gradientdomain from the image style transform model. Correspondingly, thereconstruction unit is configured to perform image reconstructionaccording to the feature map in the at least one reference direction inthe gradient domain to obtain a style image.

In some embodiments, the at least one reference direction includes x andy directions in a plane reference coordinate system. Correspondingly,the determining unit is configured to determine the gradients of theinitial image in the x and y directions, respectively. The obtainingunit is configured to respectively input the gradients in the x and ydirections to the image style transform model, and correspondinglyobtain a feature map of the initial image in the x and y directions inthe gradient domain from the image style transform model.Correspondingly, the reconstruction unit is configured to perform imagereconstruction according to the feature map in the x and y directions inthe gradient domain to obtain a style image.

The description of the foregoing apparatus embodiments is similar to thedescription of the foregoing method embodiments, and has similaradvantages as the method embodiments. For the technical details that arenot disclosed in the apparatus embodiments of the present disclosure,please refer to the description of the method embodiments of the presentdisclosure. It should be noted that in the embodiments of the presentdisclosure, when implemented in the form of a software functional moduleand sold or used as an independent product, the image style transformmethod may also be stored in a computer-readable storage medium. Basedon such understanding, the technical solutions of the embodiments of thepresent disclosure essentially, or the part contributing to the priorart may be implemented in the form of a software product. The computersoftware product is stored in a storage medium, including severalinstructions for instructing a computer device (which may be a personalcomputer or a server, etc.) to perform all or some of the methods in theembodiments of the present disclosure. The foregoing storage mediumincludes any medium that can store program code, such as a USB flashdrive, a removable hard disk, a Read-Only Memory (ROM), a Random-AccessMemory (RAM), a magnetic disk, or an optical disk. In this case, theembodiments of the present disclosure are not limited to any particularcombination of hardware and software. Correspondingly, the embodimentsof the present disclosure provide a computer device, including a memoryand a processor, where the memory stores a computer program that can berun in the processor, and the processor executes the program to realizeoperations of the image style transform method.

The embodiments of the present disclosure provide a computer-readablestorage medium, having a computer program stored thereon, where when thecomputer program is executed by a processor, operations of the imagestyle transform method are implemented. The embodiments of the presentdisclosure further provide a computer program product, including acomputer executable instruction, where the computer executableinstruction is executed to implement operations of the image styletransform method. It should be noted here that the description of theforegoing storage medium and apparatus embodiments is similar to thedescription of the foregoing method embodiments, and has similaradvantages as the method embodiments. For the technical details that arenot disclosed in the storage medium and apparatus embodiments of thepresent disclosure, please refer to the description of the methodembodiments of the present disclosure.

It should be noted that FIG. 7 is a schematic diagram of a hardwareentity of a computer device according to the embodiments of the presentdisclosure. As shown in FIG. 7, the hardware entity of the computerdevice 700 includes: a processor 701, a communication interface 702, anda memory 703, where the processor 701 generally controls the overalloperation of the computer device 700. The communication interface 702may enable the computer device to communicate with other terminals orservers over a network. The memory 703 is configured to storeinstructions and applications executable by the processor 701, and mayalso cache data to be processed or processed by the processor 701 andeach module of the computer device 700 (e.g., image data, audio data,voice communication data, and video communication data), which may berealized by a flash memory (FLASH) or RAM.

It should be understood that the phrase “one embodiment” or “anembodiment” mentioned in the description means that the particularfeatures, structures, or characteristics relating to the embodiments areincluded in at least one embodiment of the present disclosure.Therefore, the phrase “in one embodiment” or “in an embodiment” appearedin the entire description does not necessarily refer to the sameembodiment. In addition, these particular features, structures, orcharacteristics may be combined in one or more embodiments in anysuitable manner. It should be understood that, in the variousembodiments of the present disclosure, the size of the serial numbers inthe foregoing processes does not mean the order of execution sequence.The execution sequence of each process should be determined by itsfunction and internal logic, and is not intended to limit theimplementation process of the embodiments of the present disclosure. Theserial numbers of the embodiments of the present disclosure are merelyfor a descriptive purpose, and do not represent the advantages anddisadvantages of the embodiments.

It should be noted that the term “comprising”, “including” or any othervariant thereof herein is intended to encompass a non-exclusiveinclusion, such that a process, method, article, or apparatus thatincludes a series of elements includes those elements. Moreover, otherelements not explicitly listed are also included, or elements that areinherent to the process, method, article, or device are also included.An element defined by the phrase “including one . . . ” does not excludethe presence of additional same elements in the process, method,article, or apparatus that includes the element, without morelimitations.

In some embodiments provided by the present disclosure, it should beunderstood that the disclosed device and method can be implemented inother manners. The device embodiments described above are merelyillustrative. For example, the division of the unit is only a logicalfunction division. In actual implementation, another division manner maybe possible, for example, multiple units or components may be combined,or may be integrated into another system, or some features can beignored or not executed. In addition, the coupling, or direct coupling,or communicational connection of the components shown or discussed maybe indirect coupling or communicational connection by means of someinterfaces, devices or units, and may be electrical, mechanical or otherforms.

In addition, functional units in the embodiments of the presentdisclosure may be integrated into one processing unit, or may beseparately used as one unit, or two or more units may be integrated intoone unit. The integrated unit can be implemented in the form of hardwareor in the form of hardware plus software functional units.

Alternatively, when implemented in the form of a software functionalmodule and sold or used as an independent product, the integrated unitof the present disclosure may also be stored in a computer-readablestorage medium. Based on such understanding, the technical solutions ofthe embodiments of the present disclosure essentially, or the partcontributing to the prior art may be implemented in the form of asoftware product. The computer software product is stored in a storagemedium and includes several instructions for instructing a computerdevice (which may be a personal computer or a server, etc.) to performall or some of the methods in the embodiments of the present disclosure.Moreover, the foregoing storage media include various media capable ofstoring program codes such as a mobile storage device, an ROM, amagnetic disk, or an optical disk.

The above are only implementation modes of the present disclosure, butthe scope of protection of the present disclosure is not limitedthereto. Any person skilled in the art could easily conceive thatchanges or substitutions made within the technical scope disclosed inthe present disclosure should be included in the scope of protection ofthe present disclosure. Therefore, the scope of protection of thepresent disclosure should be determined by the scope of protection ofthe appended claims.

1. An image style transform method, comprising: acquiring an initialimage to be subjected to style transform; inputting a gradient of theinitial image to an image style transform model, and obtaining a featuremap of the initial image in a gradient domain from the image styletransform model, wherein the image style transform model is obtained bybeing trained in the gradient domain based on a pixel-wise loss and aperceptual loss; and performing image reconstruction according to thefeature map of the initial image in the gradient domain to obtain astyle image.
 2. The method according to claim 1, further comprising:training the image style transform model, wherein a training target ofthe image style transform model is that a total loss L_(total) isminimum, wherein L_(total) is represented by the following equation:L_(total)=αL_(feat)+βL_(pixel), where L_(feat) represents the perceptualloss, L_(pixel) represents the pixel-wise loss, and values of α and βare real numbers.
 3. The method according to claim 1, wherein the imagestyle transform model comprises a pixel-wise loss model and a perceptualloss model, wherein the pixel-wise loss model is obtained by takingminimization of the pixel-wise loss as a training target when beingtrained in the gradient domain, and the perceptual loss model isobtained by taking minimization of the perceptual loss as a trainingtarget when being trained in the gradient domain training.
 4. The methodaccording to claim 3, wherein the training process of the pixel-wiseloss model and the perceptual loss model comprises: inputting a gradientof a training sample to the pixel-wise loss model, and obtaining asample output result of the training sample from the pixel-wise lossmodel; determining a gradient of a stylized reference imagecorresponding to the training sample; and training the perceptual lossmodel according to a first output feature map of the gradient of thereference image in a j-th convolutional layer of the perceptual lossmodel and a second output feature map of the sample output result in thej-th convolutional layer of the perceptual loss model.
 5. The methodaccording to claim 4, wherein the training the perceptual loss modelaccording to a first output feature map of the gradient of the referenceimage in a j-th convolutional layer of the perceptual loss model and asecond output feature map of the sample output result in the j-thconvolutional layer of the perceptual loss model comprises: training theperceptual loss model by using the following equation:${L_{feat} = {\frac{1}{C_{j}H_{j}W_{j}}{{{\psi_{j}\left( {\partial{\mathcal{L}\left( I_{i} \right)}} \right)} - {\psi_{j}\left( {F_{W}\left( {\partial I_{i}} \right)} \right)}}}}},$where ∂I_(i) represents a gradient of an i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model, ∂

(I_(i)) represents a gradient of a stylized reference image of the i-thtraining sample; ψ_(j)( ) represents the output feature map of the j-thconvolutional layer when the perceptual loss model adopts aconvolutional neural network model, and C_(j)H_(j)W_(j) respectivelyrepresent the number of channels, the height, and the width of thefeature map corresponding to the j-th convolutional layer.
 6. The methodaccording to claim 3, wherein the training process of the pixel-wiseloss model comprises: using the gradient of the training sample as aninput of the pixel-wise loss model, and obtaining a sample output resultfrom the pixel-wise loss model; determining the gradient of the stylizedreference image corresponding to the training sample; and training thepixel-wise loss model according to the gradient of the reference imageand the sample output result.
 7. The method according to claim 4,wherein the pixel-wise loss model comprises a first convolutional layerset, an up-sampling layer and a second convolutional layer set, andwherein the training the pixel-wise loss model according to the gradientof the reference image and the sample output result comprises: inputtingthe gradient of the training sample to the first convolutional layer setto obtain the sample feature map; inputting the sample feature map tothe up-sampling layer, and up-sampling the sample feature map to a pixelsize of the initial image; and inputting the up-sampled sample featuremap to the second convolutional layer set to obtain the sample outputresult.
 8. The method according to claim 7, wherein the training thepixel-wise loss model according to the gradient of the reference imageand the sample output result comprises: training the pixel-wise lossmodel according to an absolute value of a difference betweenF_(w)(∂I_(i)) and corresponding ∂

(I_(i)) of each training sample; where ∂I_(i) represents the gradient ofthe i-th training sample, F_(w) represents the pixel-wise loss model,F_(w)(∂I_(i)) represents an output result of the gradient of the i-thtraining sample through the pixel-wise loss model F_(w), and ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample.
 9. The method according to claim 8, wherein thetraining the pixel-wise loss model according to an absolute value of adifference between F_(w)(∂I_(i)) and corresponding ∂

(I_(i)) of each training sample comprises: training the pixel-wise lossmodel by using the following equation${L_{pixel} = {\frac{1}{D}{\sum\limits_{i = 0}^{D - 1}\; \left\{ {\frac{1}{2}{{{F_{W}\left( {\partial I_{i}} \right)} - {\partial{\mathcal{L}\left( I_{i} \right)}}}}^{2}} \right\}}}},$where ∂I_(i) represents the gradient of the i-th training sample, F_(w)represents the pixel-wise loss model, F_(w)(∂I_(i)) represents an outputresult of the gradient of the i-th training sample through thepixel-wise loss model F_(w), ∂

(I_(i)) represents the gradient of the stylized reference image of thei-th training sample, and D represents a number of samples in a trainingsample set.
 10. The method according to claim 1, wherein the performingimage reconstruction according to the feature map of the initial imagein the gradient domain to obtain a style image comprises: using an imagethat satisfies a structural similarity condition to the feature map ofthe initial image in the gradient domain as the style image; whereinsatisfying the structural similarity condition to the feature map of theinitial image in the gradient domain satisfies comprises: a degree ofstructural difference between the style image and the initial image isless than a similarity threshold, or the degree of structural differencebetween the style image and the initial image is the smallest, whereinthe degree of structural difference is a variation trend of the styleimage in the gradient domain and the feature map of the initial image inthe gradient domain in at least one reference direction.
 11. The methodaccording to claim 10, wherein the performing image reconstructionaccording to the feature map of the initial image in the gradient domainto obtain a style image comprises: performing image reconstructionaccording to ∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥² to obtainthe style image, where ∂_(x)I represents the gradient of the initialimage in an x direction, F_(w)(∂_(x)I) represents a feature map of thegradient of the initial image in the x direction in the gradient domainthrough the image style transform model, ∂_(y)I represents the gradientof the initial image in a y direction, F_(w)(∂_(y)I) represents afeature map of the gradient of the initial image in the y direction inthe gradient domain through the image style transform model, ∂_(x)Srepresents the gradient of the style image in the x direction, and∂_(y)S represents the gradient of the style image in they direction. 12.The method according to claim 1, wherein the performing imagereconstruction according to the feature map of the initial image in thegradient domain to obtain a style image comprises: performing imagereconstruction according to color information of the initial image andthe feature map of the initial image in the gradient domain to obtainthe style image.
 13. The method according to claim 12, wherein theperforming image reconstruction according to color information of theinitial image and the feature map of the initial image in the gradientdomain to obtain a style image comprises: using an image that satisfiesa structural similarity condition to the feature map of the initialimage in the gradient domain and an image that satisfies a colorsimilarity condition to the initial image as the style image.
 14. Themethod according to claim 12, further comprising: performing featureextraction on the initial image to obtain a face region in the initialimage; and correspondingly, the performing image reconstructionaccording to color information of the initial image and the feature mapof the initial image in the gradient domain to obtain a style imagecomprises: using an image that satisfies a structural similaritycondition to the feature map of the initial image in the gradientdomain, and an image that satisfies a color similarity condition to aface region in the initial image as the style image.
 15. The methodaccording to claim 13, wherein the using an image that satisfies astructural similarity condition to the feature map of the initial imagein the gradient domain and an image that satisfies a color similaritycondition to the initial image as the style image comprises: performingimage reconstruction according to∥S−I∥+λ{∥∂_(x)S−F_(w)(∂_(x)I)∥²+∥∂_(y)S−F_(w)(∂_(y)I)∥²} to obtain thestyle image, where I represents the initial image, S represents thestyle image, ∂_(x)I represents the gradient of the initial image in thex direction, F_(w)(∂_(x)I) represents a feature map of the gradient ofthe initial image in the x direction in the gradient domain through theimage style transform model, ∂_(y)I represents the gradient of theinitial image in the y direction, F_(w)(∂_(y)I) represents a feature mapof the gradient of the initial image in the y direction in the gradientdomain through the image style transform model, ∂_(x)S represents thegradient of the style image in the x direction, and ∂_(y)S representsthe gradient of the style image in the y direction.
 16. The methodaccording to claim 1, wherein the inputting a gradient of the initialimage to an image style transform model, and obtaining a feature map ofthe initial image in a gradient domain from the image style transformmodel comprises: determining a gradient of the initial image in at leastone reference direction; inputting the gradient in the at least onereference direction to the image style transform model, andcorrespondingly obtaining a feature map of the initial image in the atleast one reference direction in the gradient domain from the imagestyle transform model; and correspondingly performing imagereconstruction according to the feature map in the at least onereference direction in the gradient domain to obtain the style image.17. The method according to claim 16, wherein the at least one referencedirection comprises x and y directions in a plane reference coordinatesystem, and correspondingly determining the gradients of the initialimage in the x and y directions, respectively; respectively inputtingthe gradients in the x and y directions to the image style transformmodel, and correspondingly obtaining a feature map of the initial imagein the x and y directions in the gradient domain from the image styletransform model; and correspondingly performing image reconstructionaccording to the feature map in the x and y directions in the gradientdomain to obtain the style image.
 18. An image style transformapparatus, comprising: a memory storing processor-executableinstructions; and a processor arranged to execute the storedprocessor-executable instructions to perform operations of: acquiring aninitial image to be subjected to style transform; inputting a gradientof the initial image to an image style transform model, and obtaining afeature map of the initial image in a gradient domain from the imagestyle transform model, wherein the image style transform model isobtained by being trained in the gradient domain based on a pixel-wiseloss and a perceptual loss; and performing image reconstructionaccording to the feature map of the initial image in the gradient domainto obtain a style image.
 19. The apparatus according to claim 18,wherein the processor is arranged to execute the storedprocessor-executable instructions to further perform an operation of:training the image style transform model, wherein a training target ofthe image style transform model is that a total loss L_(total) isminimum, wherein L_(total) is represented by the following equation:L_(total)=αL_(feat)+βL_(pixel), where L_(feat) represents the perceptualloss, L_(pixel) represents the pixel-wise loss, and values of α and βare real numbers.
 20. A non-transitory computer-readable storage mediumhaving stored thereon computer executable instructions that, whenexecuted by a processor, cause the processor to implement an image styletransform method, the method comprising: acquiring an initial image tobe subjected to style transform; inputting a gradient of the initialimage to an image style transform model, and obtaining a feature map ofthe initial image in a gradient domain from the image style transformmodel, wherein the image style transform model is obtained by beingtrained in the gradient domain based on a pixel-wise loss and aperceptual loss; and performing image reconstruction according to thefeature map of the initial image in the gradient domain to obtain astyle image.